-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Bugfix] Missing NIXL metadata for handshake initialization if instance spans multi-node #26338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…pans multi-node Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: GuanLuo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will look into it more in depth asap, in the meantime tagging other people that may be interested in reviewing @wseaton @markmc @njhill
PS to summarize: listener thread is moved from worker->scheduler. Scheduler aggregates metadata from all workers. Workers carry out handshake by fetching data from Scheduler (single port).
|
This does seem like a good way to get in the collective RPC and scheduler infa changes that #22274 needs, so I am supportive 🙂 |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Guan Luo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the hard work @GuanLuo !
Let's leave a small window for other reviewers to chime in today, o/w we merge when CI is green again (unrelated failures should be fixed on main first).
| # Need to make sure the device ID is non-negative for NIXL, | ||
| # Torch uses -1 to indicate CPU tensors while NIXL uses explicit | ||
| # memory type. | ||
| self.device_id = max(cache.get_device(), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, what happens if all caches were indeed cpu tensors (not a use-case we have, but good to highlight) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cache is CPU tensor, NIXL descriptor registration should still work as intended as self.nixl_memory_type is passed to get_reg_descs which is how NIXL differentiate if the memory is CPU / GPU
SGTM, this shouldn't impact changing handshake strategy in the future. |
|
@GuanLuo good to merge after conflict |
Signed-off-by: GuanLuo <[email protected]>
|
@NickLucche Resolved, kicked off CI |
Signed-off-by: GuanLuo <[email protected]>
Signed-off-by: GuanLuo <[email protected]>
Head branch was pushed to by a user without write access
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: GuanLuo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
…ce spans multi-node (vllm-project#26338) Signed-off-by: Guan Luo <[email protected]> Signed-off-by: GuanLuo <[email protected]> Signed-off-by: Guan Luo <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]>
…ce spans multi-node (vllm-project#26338) Signed-off-by: Guan Luo <[email protected]> Signed-off-by: GuanLuo <[email protected]> Signed-off-by: Guan Luo <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]>
…
Purpose
Fix NIXL handshake issue when model instance spans multiple nodes due to parallelism strategy (i.e. TP=16 and run on 2 H100x8), see #25981 for detail
Test Plan
test_nixl_connector.pyfor unit testing,test_kv_transfer_metadatato verify thatNIXLConnectorWorkerproperly return its handshake metadata and retrieve target metadata;NIXLConnectorSchedulercan serve the collective metadata.nixl_integration/**for integration testing, this covers ifEngineCoreproperly gather and serve the handshake metadata.Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.